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QCDSP: The first 64 nodes 
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We present a summary of the progress on QCDSP in the last year. QCDSP, Quantum Chromodynamics on 
Digital Signal Processors, is an inexpensive computer being built at Columbia that can achieve 0.8 teraflops for 
three million dollars. 



1. INTRODUCTION 

QCDSP (Quantum Chromodynamics on Digi- 
tal Signal Processors) is the name of the single 
precision, Digital Signal Processor (DSP) based 
computer being built at Columbia. DSP's are es- 
sentially general purpose microprocessors, whose 
design has been optimized for inexpensive float- 
ing point power. The project was begun in April, 
1993 as an inexpensive way to obtain a machine 
that would give almost teraflop scale performance 
for three million dollars. In particular, for this 
price a machine with 16,384 processing nodes, 
connected as a 16 3 x 4 four-dimensional hyper- 
cubic array with a nearest neighbor communica- 
tions network, would have a peak speed of 0.8 
teraflops. 

Each processing node of QCDSP contains a 
50 Mflop Texas Instruments TMS320C31 DSP, 2 
MBytes of DRAM and a custom Application Spe- 
cific Integrated Circuit (ASIC), called the Node 
Gate Array (NGA). The NGA was designed at 
Columbia and provides three major functional- 
ities. First, it contains a programmable fetch- 
ahead memory buffer to increase the effective 
bandwidth of the DRAM for regular patterns of 
memory access. Second, it handles DRAM re- 
fresh and single bit error detection and correc- 
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tion. Its third major function is to provide a 25 
or 50 MHz (the speed is programmable) , bit serial 
communication link to each of the eight nearest 
neighbor processoring nodes. This functionality 
of the NGA is handled by the serial communi- 
cations unit (SCU). A fully assembled daughter 
board costs about $120 in quantity. 

The initial design for QCDSP was described at 
Lattice '93 0, details of the processing nodes, 
NGA design and features at Lattice '94 m and 
final NGA design issues, the networks in QCDSP 
and some details of the global architecture at Lat- 
tice '95 ||. Since Lattice '95, the design for our 
processing nodes, motherboards, backplanes and 
crates has been completed and the design trans- 
lated into working hardware, demonstrated at 
Lattice '96 in the form of a working 64 proces- 
sor machine. In this paper, we will describe some 
of the milestones in this process, give more de- 
tails about the hardware architecture of the full 
machine and discuss our evolving software envi- 
ronment. 

2. GLOBAL ARCHITECTURE 

Most of the processing nodes described in the 
introduction are realized on a roughly 1.8" by 
2.7" printed circuit board. We refer to these as 
daughter boards. Our motherboards are roughly 
14.5" by 20.5", 10 layer printed circuit boards. A 
motherboard holds 63 daughter boards attached 
to it through SIMM connectors. (These daugh- 
ter boards can be replaced in under a minute if 
one fails.) A 64th processing node, referred to as 
node 0, is soldered directly to the motherboard. 
A motherboard contains a4x4x2x2 configu- 
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ration of processors, with the first dimension pe- 
riodic on the motherboard. Eight motherboards 
are included in a crate and four crates are stacked 
into a rack. 

In any numerical calculation, all the process- 
ing nodes behave similarly. Node on each 
motherboard has capabilities that the daughter 
boards do not. While the daughter boards have 
2 MBytes of local DRAM, node has 8 Mbytes. 
Node also has special functionality that is most 
apparent at boot time, during hardware testing 
and while doing I/O. It controls the two indepen- 
dent SCSI buses that can be attached to a moth- 
erboard, the connections of the DSP serial net- 
work described below, access to the single PROM 
on the motherboard and the electronics driving 
the off-board physics network connections. 

In addition to the four-dimensional nearest 
neighbor network described in the introduction, 
each motherboard has a DSP serial network ||]. 
Figure |l| shows these two networks, plus an exam- 
ple of a SCSI tree connection which links moth- 
erboards to each other and the host workstation. 
The SCSI network allows the host workstation to 
communicate with node on any motherboard. 
The DSP serial network allows node on a moth- 
erboard to broadcast to all 63 daughter boards or 
read and write to one daughter board at a time. 
The SCSI tree plus DSP serial network provide 
a reliable, relatively slow network for boot-time 
I/O and hardware verification. 

3. HARDWARE COMPONENTS 

3.1. PROCESSING NODES 

There are seven surface mounted integrated cir- 
cuit chips on a daughter board; one DSP, one 
NGA and 5 DRAM chips. The NGA can drive 
two different types of physical chips (either 51 2k 
by 8 bit DRAMs or 256k by 16 bit DRAMs) to 
help insulate us from changes in the available con- 
figurations of memory. From a programmer's per- 
spective, the type of physical memory used is ir- 
relevant. 

We took delivery of the first 10 samples of the 
gate array in late July 1995. Within a week, we 
had assembled 4 working daughter boards. By 
far the largest problem we encountered was do- 



ing the surface mount soldering on the fine pitch 
(pin spacing of 0.020") NGA. In March 1996, we 
had 131 daughter boards commercially assem- 
bled. Using test software we developed, the 10 
daughter boards which had assembly problems 
were fixed in about 45 minutes by the assemblers. 
This example lends credence to the idea of hav- 
ing thousands of daughter boards commercially 
assembled with few failures. 

3.2. MOTHERBOARDS 

The initial sketch of a motherboard in |Q is 
surprisingly similar to the final result; surface 
mounting node directly to the board is the one 
major change. During the last year, the moth- 
erboard circuitry was finalized and the resulting 
10 layer printed circuit board was laid out at 
Columbia. The first two motherboards arrived 
in December and were hand assembled by us. By 
mid-January, all the peripheral systems associ- 
ated with node zero were working. The remain- 
der of the systems were debugged by mid-March, 
after the arrival of enough daughter boards to 
fully populate the motherboard. The mother- 
board displayed at Lattice '96 was the first one 
completely assembled. 

3.3. BACKPLANES AND CRATES 

During the spring and summer of 1995, the 
specification of the backplane electronics was fi- 
nalized. The backplane design includes an equal- 
time, 50 MHz clock fanout to all the crates in the 
machine. This is important since nearest neigh- 
bors in the four dimensional network will gener- 
ally be in different crates. An equal time reset sig- 
nal is also distributed. Even though the machine 
is self-synchronizing on nearest neighbor links, 
it can be useful for debugging purposes to have 
all processors come up from reset synchronously. 
Another vital reason for a synchronous reset is so 
all nodes will agree on the correct phase for a 25 
MHz signal derived from the 50 Mhz clock. When 
the SCUs are running at 25 MHz, all nodes must 
agree on this signal. 

The backplane also handles the distribution of 
three global interrupts linking all nodes. One 
interrupt gives a global synchronization signal, 
the second a non-recoverable error signal and the 
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third as a recoverable error signal. 

The backplane circuit boards were laid out and 
assembled by Bustronics. The crates were de- 
signed and built by Elma Corporation, with the 
first single crate arriving in December 1995. The 
racks of four crates are under development cur- 
rently. Each crate requires a separate 20 amp, 
220 volt circuit and a single crate is cooled by 
a tray of muffin fans. The four crates that are 
stacked into a rack will have water-cooled heat 
exchangers between each crate. 

For a given physical configuration of the crates, 
the four dimensional configuration of QCDSP is 
determined by external cables connected to the 
backplane. In particular, the cabling can produce 
a machine of dimension 4 x Ai x 2j x 2k where i, 
j and k are integers. Two and three dimensional 
arrays are also possible by only altering the ex- 
ternal cabling. 

4. SOFTWARE 

4.1. 2 NODE PROTOTYPE 

A rudimentary operating system was writ- 
ten for a 2 node machine, connected to a host 
SUN computer through a commercial SBUS DSP 
board. This operating system provides read, 
write and execute access to the prototype. It has 
proved very useful for code development and test- 
ing. The members of the collaboration from SCPJ 
have ported much of their macro-based QCD code 
to this 2 node prototype, providing additional 
testing of the hardware. A quenched update code 
has been written at Columbia and tested on the 
prototype. 

4.2. SOFTWARE FOR HARDWARE 
TESTING 

Substantial effort has gone into this area. After 
assembling the first daughter boards, they were 
subjected to a variety of tests for reliability, cor- 
rect numerical results and power consumption. 
Concurrent with the assembly of the first two 
motherboards, software was written to test each 
system as it was completed. This software will 
become part of the standard boot-time hardware 
checking done on QCDSP. 

We have already mentioned the automated 



testing software used to check the commercial 
daughter board assembly. This was so successful 
that we are currently packaging our motherboard 
hardware test software into a series of diagnostic 
and functional tests to be run at the assemblers. 
After daughter boards are assembled and tested, 
they will be inserted into a motherboard and the 
entire assembly tested. We expect this will re- 
sult in fully populated and competely functional 
boards being shipped to Columbia. 

4.3. QCDSP OPERATING SYSTEM 
SOFTWARE 

We have completed the most basic level of the 
QCDSP operating system, called the boot ker- 
nels. The boot kernels are running after machine 
reset and provide a robust I/O path to all nodes 
in QCDSP. These kernels fit into the 2k words of 
DSP internal memory and utilize only the DSP 
serial network and the SCSI tree. This reliance 
on a minimal amount of working hardware allows 
us to do much hardware debugging through soft- 
ware. 

The boot kernel on node includes a fully func- 
tional SCSI driver as well as control features to 
switch data to various daughter boards. We have 
standardized a recursive routing protocol which 
can handle an arbitrary number of layers in the 
SCSI tree. We have used the boot kernels to com- 
municate with a second motherboard and a disk 
from the first motherboard. 

On our host SUN, we have an X-windows inter- 
face to the machine, which calls a custom SCSI 
driver to handle I/O to QCDSP. C programs 
compiled with the standard Texas Instruments 
tools can be loaded and run and the contents of 
memory on any node of QCDSP returned to the 
screen. The boot kernels, coupled with the inter- 
face, provide full read, write and execute capabil- 
ity on any node or collection of nodes. At SCRI, 
a command line interface has also been written. 

The run kernels are currently being developed. 
They will include all the features of the boot ker- 
nels, but will be larger since DRAM will be avail- 
able when they are started. They will allow for 
I/O (such as printf) to the host, access to the disk 
system, interrupt handlers for machine errors and 
simple access to various hardware components. 
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4.4. PHYSICS SOFTWARE 

The optimized versions of the staggered and 
Wilson conjugate gradient codes written when we 
were designing the NGA have run on single node, 
2 node and 64 node physical machines. In the 
last year, a quenched evolution has been written 
at Columbia. A dynamical fermion update code 
and gauge fixing code are under current develop- 
ment. In addition, the macro-based QCD code 
used by collaborators at SCRI has been ported 
to QCDSP. 

We have purchased a commercial C++ com- 
piler specifically for the Texas Instruments DSP 
from Tartan, Inc. This compiler includes dou- 
ble precision arithmetic libraries. These will run 
quite slowly since the hardware is single preci- 
sion, but will allow us to easily check code for 
stability under increased precision. We hope to 
develop C++ classes which will hide the paral- 
lel nature of the machine from the user. These 
should be useful in the parts of programs that 
are not floating-point intensive. 

5. CONCLUSIONS 

In the last year we have built, or had built, 
a few of each hardware component of QCDSP. 
Shortly after Lattice '96, we successfully ran 
physics programs on a 2 motherboard, 128 node 
machine. A fully populated 512 node, 25 Gflop 
machine should be finished by the end of Septem- 
ber. All the parts are on hand for a 2048 node, 
100 Gflop machine, which should be completed 
by the end of November. At that time we expect 
to begin to purchase components and have them 
assembled for a 400 Gflop machine (8192 nodes) 
to be completed by the spring of 1997. 
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Figure 1. A diagram of the various commu- 
nications paths in QCDSP. Here the nodes are 
depicted as located in a two-dimensional mesh 
(for ease of exposition) rather than in a four- 
dimensional mesh actually implemented. The 
filled circles represent processing nodes, the thin 
straight lines the four-dimensional nearest neigh- 
bor network, the thin dashed lines the DSP serial 
network on the motherboards and the thick dot- 
ted lines the connections in the SCSI tree. The 
thick solid lines making up boxes give the physical 
boundary of the motherboard. 



